This  ICCV  paper  is  the  Open  Access  version,  provided  by  the  Computer  Vision  Foundation. 
Except  for  this  watermark,  it  is  identical  to  the  version  available  on  IEEE  Xplore. 


Self-Occlusion  and  Disocclusion  in  Causal  Video  Object  Segmentation 

Yanchao  Yang1,  Ganesh  Sundaramoorthi2,  and  Stefano  Soatto1 

University  of  California,  Los  Angeles,  USA  2King  Abdullah  University  of  Science  &  Technology  (KAUST),  Saudi  Arabia 

yyc8912@g . ucla . edu,  ganesh . sundaramoorthi @kaust .edu.sa,  soatto@ucla . edu 


Abstract 

We  propose  a  method  to  detect  disocclusion  in  video  se¬ 
quences  of  three-dimensional  scenes  and  to  partition  the 
disoccluded  regions  into  objects,  defined  by  coherent  defor¬ 
mation  corresponding  to  surfaces  in  the  scene.  Our  method 
infers  deformation  fields  that  are  piecewise  smooth  by  con¬ 
struction  without  the  need  for  an  explicit  regularizer  and 
the  associated  choice  of  weight.  It  then  partitions  the  dis¬ 
occluded  region  and  groups  its  components  with  objects  by 
leveraging  on  the  complementarity  of  motion  and  appear¬ 
ance  cues:  Where  appearance  changes  within  an  object, 
motion  can  usually  be  reliably  inferred  and  used  for  group¬ 
ing.  Where  appearance  is  close  to  constant,  it  can  be  used 
for  grouping  directly.  We  integrate  both  cues  in  an  energy 
minimization  framework,  incorporate  prior  assumptions  ex¬ 
plicitly  into  the  energy,  and  propose  a  numerical  scheme. 

1.  Introduction 

Persistent  tracking  of  three-dimensional  (3D)  objects  in 
video  presents  long-standing  challenges  unless  they  are  flat 
[33],  or  the  video  is  short  [25].  As  surfaces  move  in  3D  rel¬ 
ative  to  the  viewer,  previously  unseen  portions  of  the  scene 
become  visible  and  will  have  to  be  attributed  to  different 
objects  to  maintain  tracking.  Such  disocclusion  phenomena 
are  the  focus  of  our  investigation. 

Consider  a  camera  rotating  around  a  box  in  Fig  1:  Both 
the  occluded  and  disoccluded  regions  involve  portions  of 
different  objects,  in  this  case  just  the  box  and  the  “back¬ 
ground.”  Occlusions  have  been  addressed  by  [29,  1].  We 
focus  on  disocclusions,  by  determining  the  disoccluded  area 
(Sect.  2),  partitioning  it  and  grouping  each  portion  with  an 
object  (Sect.  3). 

Grouping  unseen  portions  of  the  scene  into  different  ob¬ 
jects  requires  prior  assumptions  on  their  properties.  One 
could  assume  that  the  “appearance”  or  “texture”  of  objects 
is  homogeneous  (i.e.,  their  reflectance  exhibits  spatially  sta¬ 
tionary  statistics)  and  leverage  on  the  similarity  of  image 
color  histograms  to  partition  and  group  disoccluded  regions. 
However,  this  assumption  often  fails,  as  in  Fig.  1.  Alterna- 


Figure  1 .  Relative  motion  between  a  three-dimensional  scene  and 
the  camera  (here  rotating  around  the  box)  causes  disocclusion,  i.e., 
regions  of  the  image  domain  where  previously  unseen  portions  of 
the  scene  project  to.  Unless  objects  in  the  scene  are  flat,  the  dis¬ 
occlusion  include  portions  of  different  objects.  Persistent  tracking 
requires  detecting  the  disocclusion  and  attributing  their  compo¬ 
nents  to  different  objects. 

tively,  one  could  assume  that  the  “apparent  motion”  of  ob¬ 
jects  is  homogeneous  (i.e.,  the  deformation  undergone  by 
the  image  domain  is  smooth  within  objects,  and  discontin¬ 
uous  across).  However,  when  objects  exhibit  “textureless” 
surfaces  (i.e.,  constant  reflectance),  such  a  deformation  is 
undetermined,  and  cannot  be  used  for  grouping. 

Fortunately,  motion  and  appearance  cues  are  comple¬ 
mentary:  When  one  fails  to  be  informative,  the  other  may 
be.  Leveraging  such  complementarity  is  central  to  this  pa¬ 
per.  When  the  disoccluded  region  exhibits  complex  ap¬ 
pearance,  motion  can  be  reliably  inferred  and  exploited  for 
grouping.  Otherwise,  when  the  disoccluded  region  is  tex¬ 
tureless,  photometric  statistics  are  spatially  homogeneous 
and  can  be  reliably  used  for  grouping.  Of  course,  both  cues 
can  fail  if  an  object  has  piecewise  constant  appearance,  and 
the  transition  happens  right  at  the  disocclusion  (Fig.  2). 
However,  these  are  accidental  phenomena  that  do  not  persist 
in  long  temporal  sequences. 

For  us,  objects  are  layouts  of  piecewise  smooth  and 
smoothly  deforming  surfaces  in  3D  supporting  Lamber¬ 
tian  reflection  seen  under  constant  illumination  throughout 
a  video  sequence.  There  can  be  multiple  objects  moving 
independently,  in  addition  to  viewer  (or  equivalently  back¬ 
ground)  motion.  Under  these  assumptions,  the  domain  of 
a  video  image  of  a  scene  can  be  partitioned  into  two  types 
of  regions:  Those  that  are  co-visible,  that  under  the  stated 
assumptions  are  a  smooth  deformation  of  regions  in  the  pre- 
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vious  frame,  and  those  that  are  disoccluded ,  i.e.,  whose  pre¬ 
image  under  perspective  projection  is  a  portion  of  a  surface 
that  was  not  visible  in  the  previous  frame(s).  In  addition, 
occluded  regions  are  subsets  of  a  region  that,  in  the  pre¬ 
vious  frame,  was  occupied  by  an  object  different  than  the 
current  one.  These  have  been  addressed  by  others  [1]. 

Disoccluded  regions  in  a  video  are  the  occluded  regions 
in  the  video  played  backwards.  Because  we  eventually  aim 
at  real-time  closed-loop  operation,  we  wish  to  process  the 
data  causally.  Furthermore,  parts  of  objects  can  appear  in 
a  frame  and  disappear  in  the  next,  a  case  which  forward- 
backward  sweeps  would  not  address  (Sect.  4).  With  an 
abuse  of  nomenclature,  we  refer  to  “objects”  as  both  the 
connected  surfaces  in  3D,  and  the  subsets  of  the  (2D)  image 
domain  where  they  project. 

Contributions:  To  detect  disocclusions,  we  extend  the 
Sobolev  framework  of  [3<  ]  to  multiple  objects  (Sect.  2). 
This  framework  naturally  encompasses  coarse-to-fine  de¬ 
formation  inference  without  an  explicit  regularizer  and  the 
associated  weighting  constant.  To  partition  and  group  dis¬ 
occluded  regions  to  various  objects,  we  leverage  on  the 
complementarity  of  motion  and  appearance  cues  by  intro¬ 
ducing  a  novel  data  term  that  encompasses  both  (Sect.  3). 
We  derive  an  efficient  numerical  scheme  and  test  it  against 
competing  methods  on  benchmark  datasets  (Sect.  4). 

1.1.  Related  work 

Persistent  object  tracking  in  video  touches  upon  a  large 
body  of  work  in  video  segmentation  (e.g.,  [14,  19,  36,  16]), 
tracking  (e.g.,  [35,  3,  12,  20]),  optical  flow  (e.g.,  [15,  6, 
7,  39,  26]),  and  motion  segmentation  (e.g.,  [33,  27]).  In 
dealing  with  visibility  phenomena,  our  work  relates  to  oc¬ 
clusion  detection.  There  is  a  literature  on  detecting  occlud¬ 
ing  boundaries  from  static  images  or  short-baseline  video 
(see  [29]  and  references  therein).  Since  we  tackle  persis¬ 
tent  tracking,  we  do  not  discuss  this  further.  Our  work  is 
related  to  [1]  that  partitions  the  image  domain  into  (flat) 
layers  like  [33],  but  in  a  convex  optimization  setting  after 
relaxing  the  norm  to  i\.  We  detect  occlusions  without 
the  need  for  such  a  relaxation  and  without  the  need  for  reg¬ 
ularization  of  the  deformation  field,  which  can  cause  over¬ 
smoothing  in  some  regions,  and  under- smoothing  in  others. 
Instead,  following  [38]  we  employ  a  Sobolev  approach  [28] 
(see  also  [9,  4])  to  infer  deformation  fields  that  are  by  con¬ 
struction  smooth  in  a  naturally  coarse-to-fine  manner.  On  a 
short  time-scale,  such  deformation  fields  are  related  to  op¬ 
tical  flow,  which  we  do  not  review  here,  except  for  when 
the  flow  is  partitioned  into  regions,  as  in  motion  segmenta¬ 
tion.  There,  the  flow  field  is  often  assumed  to  be  piecewise 
parametric.  Here  we  allow  each  component  to  be  a  dif- 
feomorphism  to  handle  articulated  and  deforming  objects 
without  over- segmenting  them.  Other  motion  segmentation 
approaches  perform  clustering  of  optical  flow,  often  non- 


causally  [23,  14]. 

Although  our  goal  is  segmentation,  our  method  produces 
diffeomorphic  warps,  and  relates  to  diffeomorphic  regis¬ 
tration,  e.g.,  [4,  32,  10].  We  produce  a  piecewise  dif- 
feomorphism  of  the  image  rather  than  a  global  diffeomor- 
phism  as  in  [4,  32,  10],  an  assumption  that  breaks  under 
(dis)occlusions.  Also,  our  warp  computation  is  parameter- 
free  in  contrast  to  [4,  32,  10]. 

Taylor  et  al.  [3(  ]  perform  layer  segmentation  in  longer 
video  sequences  leveraging  occlusion  cues,  but  do  not  ex¬ 
plicitly  address  the  interplay  of  motion  and  intensity  cues 
in  disocclusion.  Similarly,  [2r  ]  performs  layered  segmen¬ 
tation  by  grouping.  Only  intensity  cues  are  used  for  the 
disocclusion  in  [8,  38]. 

This  work  also  relates  to  dense  3D  reconstruction  of  ge¬ 
ometry  and  photometry  [18, 22,  37, 13, 1  ],  since  an  explicit 
3D  reconstruction  of  the  scene  produces  as  a  side  effect  a 
partition  of  the  video  into  regions.  However,  it  requires  a 
static  scene,  and  does  not  address  deforming  objects  mov¬ 
ing  independently,  which  our  work  addresses. 

2.  Sobolev  Warps  and  Occlusions 

We  seek  to  partition  the  domain  D  of  a  time-varying 
color  image  It  :  D  C  M2  M3  for  t  =  1,2,...,  into 
a  collection  {R\}f=1  of  regions  Rj.  We  omit  the  time  in¬ 
dex  hereafter  for  simplicity.  These  regions  are  also  called 
“objects,”  that  move  coherently ,  as  defined  next. 

The  (apparent)  motion  of  each  region  Ri ,  also  referred 
to  as  a  warp  or  a  deformation ,  is  defined  in  the  domain  of 
the  image  It  as  the  map  Wi  :  Ri  D  that  transforms  h+ 1 
back  to  It.  Assuming  the  scene  is  Lambertian,  illumination 
is  constant,  and  the  image  is  corrupted  by  additive  zero- 
mean  Gaussian  noise,  the  maximum-likelihood  estimate  of 
Wi  is  obtained  by  minimizing  EW3iYp(wi ,  Oi),  given  by 

£warp=  f  \It+i{wi{x))  -  It{x)\2  dx  +  f3  [  da;,  (1) 
JRi\Oi  JOi 

where  Oi  C  Ri  is  the  (unknown)  occluded  region  that  is 
visible  at  time  t  but  not  at  time  t  +  1.  Note  that,  although 
Wi  is  defined  on  all  of  Ri ,  the  data  h+i,it  only  provides 
evidence  of  it  in  the  co-visible  region  Ri\Oi.  To  avoid  the 
trivial  solution  Oi  =  Ri  and  thus  Wi  undetermined,  we  put 
a  penalty  on  the  occluded  area  as  in  [1]. 

Eq.  (1)  is  reminiscent  of  many  optical  flow  estimation  al¬ 
gorithms  [15,  6,  7,  39],  but  there  are  important  differences: 
First,  each  warp  is  restricted  to  a  subset  Ri  C  D  with 
no  compatibility  condition  or  relation  among  the  different 
warps.  Second,  there  is  no  regularizer  for  the  warps.  Most 
motion  segmentation  or  optical  flow  schemes  either  assume 
that  each  warp  belongs  to  a  (small-dimensional)  paramet¬ 
ric  family  such  as  the  group  of  affine  transformations,  or 
impose  a  penalty  on  the  (piecewise)  smoothness  of  Wi .  In¬ 
stead,  we  leverage  on  the  Sobolev  framework  [28]  to  impose 
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regularity  in  a  naturally  coarse-to-fine  framework,  while  al¬ 
lowing  the  warps  to  be  arbitrary  diffeomorphisms  (smooth 
maps  with  a  smooth  inverse).  So,  rather  than  adding  a  reg- 
ularizer  for  the  warps  in  (1),  we  compute  each  warp  as  the 
integral  of  a  smooth  time- varying  vector  field  that,  at  each 
instant,  belongs  to  a  Sobolev  space.  This  allows  us  to  effi¬ 
ciently  optimize  (1)  without  imposing  global  regularization, 
which  may  be  too  much  for  fine- scale  objects,  and  too  little 
for  large  ones. 

Given  the  warp  Wi ,  the  optimal  occlusion  Oi  is 
Oi  =  {x  G  Ri  :  \It+i(wi(x))  -  It(x) |2  >  /?}.  (2) 

Substituting  the  expression  above  into  the  energy,  we  obtain 

Eviaip(wi)  =  p(It+1(wi(x))  -  It(x))Ax,  (3) 

jRi 

which  now  depends  only  on  the  warp  Wi ,  and  where 

p(y)  =  \y\2  for  \y\2  <  P  and  p(y)  =  ft  for  \y\ 2  >  /3 

(4) 

With  this,  we  can  finally  clarify  the  notion  of  “coherent  mo¬ 
tion”  used  to  define  the  regions  Rp.  A  region  Ri  moves  co¬ 
herently  if  there  is  a  warp  Wi  that  is  smooth  according  to 
the  Sobolev  metric,  that  (locally)  minimizes  (3). 

The  gradient  of  Ew& rp,  Gi  :  Wi{Rf)  — M2,  with  respect 
to  the  Sobolev  metric  has  been  computed  by  [38]  and  is 

Gi{x)  =  V  SobE(wi)(x)  =  avg(Fj)  +  i  G»(x),  (5) 

a 

where  a  >  0  is  a  parameter  that  will  be  eliminated  below, 

Fi  :  Wi(Ri)  M2  is 

Fi  =  V/t+iVp(/t+i  -It  °  wp)  det  Va-/1.  (6) 

avg(i^)  is  the  average  over  Wi(R ),  V  is  the  vector  of  par- 
tials,  and  Gi  satisfies  the  partial  differential  equation  (PDE): 

r-A Gi(x)  =  Fi(x )  -  a vg(Fi)  x  e  Wi{Rf) 

<  VGi(x)  •  N  =  0  x  G  dwi(Ri)  ,  (7) 

,avg  (Gi)  =  0 

where  A  is  the  Laplacian,  N  is  normal  to  dwi(Ri),  Gi  is 
the  deformation ,  and  avg(i^)  is  the  translation. 

To  extend  the  framework  to  multiple  regions,  we  ex¬ 
tend  each  warp  Wi  to  the  entire  domain  D  by  imposing 
A Gi(x)  =  0  for  x  G  D\Ri  and  a  Dirichlet  condition  on 
dR{.  The  extension  is  continuous,  but  not  differentiable 
across  Ri.1 

1  While  one  can  define  the  Sobolev  metric  over  the  entire  domain  D  [4], 
thus  naturally  having  a  regular  gradient  defined  over  the  entire  domain  D, 
this  is  avoided  to  enable  capturing  fine-scale  structures  in  a  manner  that  is 
not  influenced  by  neighboring  large-scale  structures,  for  instance  an  arm 

swinging  near  the  torso  of  a  person. 


Starting  with  the  identity  map  Wi(x)  =  x,  we  deform  it 
by  the  gradient  descent  (5)  as  follows.  Define  (\xfT  :  D 
D  and  :  D  D  as  the  evolving  warp  and  its  inverse 
where  r  is  an  artificial  time  variable  parameterizing  the  evo¬ 
lution.  The  inverse  is  needed  to  compute  Fx.  The  evolution 
of  the  warps  according  to  the  gradient  descent  of  E’warp  is 

G[  =  vSo6r;warp(^’T),  (8) 

<9t^[’°(x)  =  V<j>I'°(x)  ■  Gi (x),  (9) 

5r</>°’T(x)  = -G[(0°’T(x))  (10) 

for  all  x  e  D.  This  gives  a  coarse-to-fine  evolution.  One 
can  eliminate  the  parameter  a  by  noting  the  independence 
of  the  deformation  and  translation  components  on  a  in  (5). 
This  gives  Algorithm  1 ,  which  decreases  the  energy. 

Algorithm  1  Sobolev  Warp  Computation 

1:  Set  4 \,0{x )  =  4\,r{x)  =  x  for  r  =  0 

2:  repeat 
3:  repeat 

4:  Let  a  oo  so  GJ  =  avg(Fir)  is  a  translation 

5:  Translate:  Perform  one  iteration  of  (9)-(10) 

6:  until  a vg(F[)  =  0. 

7:  Deform :  Do  one  iteration  of  (9)-(10)  with  G\  =  G\ 

8:  until  G\  =  0 

9:  Set  Wi  =  <^’T°°  where  is  the  convergence  time 


In  Section  3.3,  we  will  need  to  compute  the  occlusion  so 
that  it  can  be  removed  in  the  next  frame.  It  can  be  computed 
at  the  end  of  the  evolution  as 

0[°°  =  {x  e  Ri  :  |/t+i(</>°’Too(x))-/t(x)|2  >  P).  (11) 

3.  Causal  Object  Segmentation 

If  the  motion  of  each  region  Ri  was  reliably  inferred, 
one  could  attempt  to  propagate  forward  the  Ri  to  segment 
the  next  frame.  Unfortunately,  regions  that  become  disoc- 
cluded  between  t  and  t  -f  1  are  not  included  in  any  of  the 
Ri.  While  this  is  not  a  major  problem  if  we  are  interested 
in  only  two  adjacent  frames,  t  and  t  +  1,  as  the  area  of 
the  occluded/disoccluded  regions  is  small,  as  time  goes  by 
the  disocclusion  typically  grows.  Thus,  this  phenomenon  is 
hard  to  ignore  when  one  considers  long  temporal  sequences. 
The  challenge  becomes  to  assign  the  various  components  of 
the  disocclusion  to  existing  regions,  or  to  spawn  new  ones. 
This  is  illustrated  in  Fig.  1:  So  long  as  the  scene  is  popu¬ 
lated  by  non-flat  surfaces,  multiple  objects  contribute  to  the 
disoccluded  region. 

We  assume  a  partition  into  objects  at  time  7  —  1  and  prop¬ 
agate  it  forward  to  time  t.  The  disocclusion,  i.e.,  the  part  of 
the  domain  D  not  covered  by  the  propagated  segmentation, 
is  initially  assigned  to  regions  based  on  estimated  warps, 
and  this  is  refined  by  minimizing  the  energy  in  Section  3.1. 
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image  at  frame  t 


object  moves  left  at  frame  t+1 


Figure  2.  Illustration  of  an  error  that  arises  in  segmentation  by 
grouping  pixels  only  based  on  motion  residuals.  The  object  (dark 
yellow)  moves  to  the  left  to  occlude  a  portion  of  the  background 
(dark  green).  Pixels  in  the  occluded  region  are  likely  to  be  classi¬ 
fied  incorrectly  in  frame  t  if  only  motion  residuals  are  used  since 
both  residuals  are  large.  When  the  background  is  constant  in  the 
occluded  region  and  around  it,  classifying  by  residuals  almost  cer¬ 
tainly  leads  to  misclassihcadons. 


3.1.  Complementarity  of  Motion  and  Appearance 

Of  course  both  appearance  and  motion  cues  are  obtained 
from  image  irradiance.  What  we  mean  by  “cues”  is  bottom- 
up  computation  that  leverages  on  the  assumption  of  smooth 
spatial  variation  of  image  irradiance  (appearance  cues)  ver¬ 
sus  smooth  temporal  variation  of  the  same  (motion  cues). 

To  attribute  disoccluded  regions  to  any  of  the  existing 
objects,  we  can  leverage  the  photometric  regularity  and  as¬ 
sign  each  segment  to  the  object  that  has  similar  “texture”  or 
motion.  We  favor  the  latter,  as  objects  can  have  spatially- 
varying  appearance,  as  in  the  cereal  box  in  Fig.  1.  This 
fails  when  the  object  and  the  background  are  textureless,  as 
in  Figure  2,  or  when  they  exhibit  similar  fine- scale  texture. 
However,  in  this  case  grouping  by  appearance  is  straightfor¬ 
ward.  We  leverage  on  this  complementarity  by  exploiting 
preferentially  motion  regularity,  consistent  with  our  defini¬ 
tion  of  objects,  resorting  to  appearance  regularity  when  the 
photometry  is  not  suitable  to  reliably  estimate  motion. 

Textureless  regions:  To  leverage  on  this  complemen¬ 
tarity,  we  use  the  local  standard  deviation  cri(x)  of  It  in 
a  neighborhood  Bxy  fl  Ri  where  Bxy  =  {y  G  D  : 
\x  —  y\  <  r'}  is  the  ball  of  radius  r'  centered  at  point  x. 
We  can  then  define  a  measure  of  local  constancy  of  any  re¬ 
gion  local  to  a  point  x  as  the  minimum  standard  deviation 
over  all  regions  that  intersect  the  ball: 

cr(x)  =  min  crdx).  (12) 

i,BxynRi^tb 

Low  values  of  a(x)  indicate  that  the  underlying  color  chan¬ 
nels  are  not  sufficiently  exciting  and  therefore  motion  esti¬ 
mates  can  be  expected  to  be  unreliable. 

Motion  ambiguity  function:  Grouping  by  residuals  also 
should  not  be  done  when  current  warp  residuals  are  large. 
Define  the  forward,  backward  and  minimum  residuals  as 

Resf  (x)  =  \It+i(w{ (x))  -  It(x)\2  (13) 

Res -(:r)  =  | It(w\(x))  -  It- i{x)\2  (14) 

Res^(x)  =  min{Res{  (x),  Res^(x)}  (15) 


where  and  w\  are  the  current  forward  and  backward 
warps  of  region  Ri.  The  backward  residual  is  used  to  re¬ 
move  some  ambiguity  in  Fig.  2  as  sometimes  occluded  pix¬ 
els  at  time  t  +  1  are  visible  at  time  t  —  1,  and  hence  the 
backward  motion  may  be  reliable.  The  minimum  of  Res^ 
over  all  regions  that  intersect  with  a  ball  around  x, 

Res(x)  =  min  Res  Ax),  (16) 

i,BxynR^<t> 

is  small  when  motion  cues  are  reliable.  We  define  the  mo¬ 
tion  ambiguity  function,  maf  0,1},  which  indicates 

whether  motion  cues  are  unreliable,  as 


maf(x)  = 


if  a(x)  <  k/r'  or  Res(x)  >  (3 
otherwise 


(17) 


where  k  >  0  is  a  parameter,  the  sensitivity  to  which  is  stud¬ 
ied  empirically  in  Sect.  4.  maf  is  1  if  the  pixel  is  in  or 
borders  a  constant  region  or  if  all  motion  residuals  are  large. 
Complementary  data  term:  The  cost  for  x  G  Ri  is 


fi{x)  =  (1— maf(x))ReSj(x)— maf(a;)  \ogpitX(It(x)),  (18) 


where  pi,x  are  local  normalized  color  histograms  of  the  im¬ 
age  It  within  the  region  Ri.  Therefore,  if  the  motion  is 
reliable,  as  defined  by  the  maf,  the  cost  is  the  residual  of  the 
pixel  in  the  region  and  if  the  motion  is  unreliable,  the  cost 
is  the  fidelity  of  the  pixel  to  the  local  intensity  distribution 
of  the  region  Ri.  The  data  energy  for  region  Ri  is  then: 

^data  =  [  fi(x)dx.  (19) 

I  Ri 

This  complementary  data  term  is  a  key  feature  in  resolving 
disocclusions  (Fig.  3). 

3.2.  Temporal  and  Spatial  Regularity 

To  leverage  temporal  and  spatial  regularity  of  the  re¬ 
gions,  we  first  note  that  the  warps  are  regular  by  construc¬ 
tion  within  the  Sobolev  framework.  We  also  note  that,  in 
between  frames,  disoccluded  regions  are  small,  adjacent  to 
the  object  they  belong  to,  and  typically  result  in  an  updated 
region  of  similar  shape.  Thus,  if  R[  is  the  forward  warping 
of  the  ith  region  from  frame  t  to  t  +  1,  we  bias  the  final 
regions  Ri  to  be  close  to  R[  in  shape  and  location. 

To  this  end,  we  construct  a  local  shape  similarity  prior. 
Measuring  the  similarity  of  Ri  and  R[  generally  requires 
knowledge  of  point  correspondences.  Similar  to  ICP  [5],  we 
assume  that  x  G  Ri  corresponds  to  its  closest  point  in  R[, 
cl i{x),  which  can  be  computed  efficiently  with  Fast  March¬ 
ing  [24].  Define  the  local  shape  similarity,  Si  :  Ri  M+, 
of  Ri  within  the  ball  Brx  to  R[  within  Bry.^  as  follows: 

Si(x)=rJ—\[  l1fli(y)-1fl'(cli(a:)-a:+2/)ld2/>  (2°) 

\nx,r\JBxr 
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Disocclusion  assignment  with  appearance  only  [38] 


Disocclusion  with  complementary  motion  and  appearance  (ours) 


Figure  3.  Rotating  around  an  object  Disoccluded  parts  of  an 
object  that  have  different  appearance  than  the  visible  parts  in 
the  previous  frame  (cereal  box)  pose  difficulties  to  existing  algo¬ 
rithms .  Labeled  above  are  various  strategies  for  addressing  dis- 
occlusions.  Our  method  also  performs  well  under  self-similar  ap¬ 
pearance  (statue),  and  handles  various  visibility  artifacts  from  non- 
convex  objects. 

where  1#  is  the  indicator  function  of  R ,  and  \BXir\  is  the 
area  of  Brx  (see  Fig.  4).  The  score  measures  the  differ¬ 
ence  between  the  shapes  Ri  fi  Bx,r  and  R[  n  Bc\.^x^r  using 
translation  invariant  set  symmetric  difference.  The  shape 
similarity  energy  is: 

^hape  =  [  Si(x)  dx.  (21) 

J  Ri 

In  addition,  to  bias  regions  Ri  towards  being  close  to  R[,  let 
dR'  denote  the  distance  function  to  dR[ ,  and  define 

^dist  =  [  dR'(x)dx.  (22) 

J  Ri 

Finally,  we  induce  spatial  regularity  of  Ri ,  i.e.,  nearby 
points  x  and  y  are  penalized  if  they  do  not  belong  to  the 
same  region.  Let 

WRimGa*(  l-lRi)  (23) 


Figure  4.  Illustration  of  the  quantities  in  the  local  shape  similarity 
term,  Si .  R[  is  the  forward  warped  region  and  Ri  is  a  candidate  in 
frame  t  +  1 .  The  region  Ri  in  a  ball  around  x  is  compared  to  R[ 
in  a  ball  around  cf(x),  the  closest  point  on  R[  to  x  to  from  St(x). 

be  a  Gaussian  smoothing  of  standard  deviation  s  of  the  com¬ 
plement  of  the  indicator  function  of  Ri  [11].  A  large  value 
of  WRi(x)  implies  that  x  G  Ri  is  near  many  points  of 
D\Ri.  We  induce  spatial  regularity  of  Ri  by 


3.3.  Overall  Model  and  Optimization  Method 

The  assumptions  underlying  our  model  are  captured  by 
the  following  energy,  which  is  minimized  with  respect  to 
the  regions  Ri : 

N 

^  ^data  +  7/s^shape  +  7d^dist  +  7s ^smooth >  (25) 

i=  1 

where  7d,  7S  >  0  are  weights.  We  optimize  the  energy 
above  by  a  first  order  approximation  to  the  gradient  descent, 
ignoring  terms  that  involve  integrals  over  Ri.  They  could 
be  easily  included,  at  a  high  computational  cost  and  modest 
performance  gain.  By  defining 

Hi(x)  =  fi(x)+-jisSi(x)+'yddR'.(x)+'ysWRi(x),  (26) 
we  arrive  at  our  optimization  scheme  in  Algorithm  2. 


Algorithm  2  Assigning  Disocclusion  to  Regions 
1:  II  initialize  Ri  for  gradient  descent 
2:  Compute  propagation  of  segmentation,  R[  using  (27) 
3:  Compute  disocclusion  D  =  D\  l f  R[ 

4:  Compute  warps  of  R[  using  Algorithm  1 
5:  Compute  Hi  by  substituting  Ri  with  R[  U  D 
6:  Set  Ri  =  R[  U  {x  e  D  :  Hi(x)  <  Hj(x ),  Mj} 

7:  //  end  initialize 

8:  repeat  //  first  order  approximation  of  gradient  descent 
9:  Update  warps  of  Ri  using  Algorithm  1 

10:  Compute  Hi 

11:  R»™  =  {xeD  :  dRi(x)<e,  Hi{x)  <  Hj{x)^j} 

12:  Update  regions  by  Ri  =  R*ew 

13:  until  Ri’ s  do  not  change  between  iterations 


Algorithm  2  first  computes  an  initialization  of  regions  Ri 
to  the  gradient  descent  (lines  2-6).  This  is  accomplished  by 
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Figure  5.  [Left]:  Segmentation  from  the  frame  t.  [Middle,  left]: 
the  propagation  of  the  segmentation  from  frame  t  tot +  1  (black 
regions  indicate  disoccluded  regions).  [Middle,  right]:  initializa¬ 
tion  of  the  regions.  [Right]:  final  segmentation. 


Figure  6.  Illustration  of  initialization  method  in  the  first  frame. 
[Left]:  Aggregation  of  optical  flow  fields,  [Right]:  initial  segmen¬ 
tation  in  the  first  frame. 


propagating  forward  the  segmentation  at  time  t  —  1  to  t: 

R'i  =  {x  £  D  :  lRt\0t{wp(x))  >  1  R^ot^wpix)),  Vj} 

<27) 

where  0\  C  R\  is  the  part  of  the  ith  region  that  is  occluded 
at  frame  t  (11),  which  is  removed,  and  Wi  is  the  warp  from 
t  —  Hot.  R[  does  not  partition  all  of  D  because  of  disoc- 
clusion.  Therefore,  the  disoccluded  region  D  =  D\  U *  R[ 
is  initially  assigned  based  on  motion  cues  computed  from 
R[  and  other  terms  in  Hi. 

With  this  initialization,  the  first  order  approximation  to 
the  gradient  descent  is  computed  (lines  9-12).  Note  that 
the  condition,  d^j  (x)  <  e ,  is  to  allow  pixel  changes  only 
within  a  band  of  the  boundaries  of  the  current  regions  so  as 
to  approximate  the  gradient  descent.  Each  step  of  the  warp 
computation  (from  t  to  t  +  1  and  from  t  to  t  —  1)  in  line  9 
requires  only  a  few  iterations  in  Algorithm  1  since  the  warps 
in  the  previous  iteration  of  line  9  are  close  to  the  final.  See 
Fig.  5  for  an  example  of  various  stages  of  this  method. 

3.4.  Initialization  for  the  First  Frame 

So  far  we  have  assumed  that,  at  time  t ,  we  have  a  parti¬ 
tion  at  time  t  —  1.  This  is  the  case  during  regime  operation 
when  processing  a  video  sequence,  but  not  when  t  =  0.  For 
certain  applications,  such  as  interactive  video  segmentation 
[3,  ],  one  can  assume  that  the  user  provides  an  initial  par¬ 
tition.  More  in  general,  a  number  of  methods  could  be  em¬ 
ployed  to  obtain  an  initial  partition,  using  a  variety  of  cues, 
including  semantic  labeling  from  trained  detectors.  While 
this  process  may  be  costly,  it  only  needs  to  be  performed 
once  as  our  method  affords  us  the  ability  to  correct  initial 
errors  based  on  motion  and  appearance  regularity. 

In  the  next  section,  we  present  results  for  an  initializa¬ 
tion  performed  by  clustering  optical  flow  (with  regularity 
(24)  using  Classic-NF  [26])  during  a  longer  initial  temporal 
segment,  until  enough  motion  is  observed  (see  Fig.  6). 


4.  Experiments 

Our  algorithm  aims  to  segment  objects ,  thus  we  test  it 
on  benchmarks  with  ground  truth  object  annotation:  the 
Freiburg-Berkeley  Motion  Seg.  (FBMS-59)  [23],  and  Seg- 
Track  (vl  &  v2)  [31,  20].  FBMS-59’s  two  sets  -  training 
(29  sequences)  and  test  (30  sequences),  range  between  19- 
800  frames  with  multiple  objects.  SegTrack  v2  consists  of 
14  sequences  ranging  from  29-279  frames  with  multiple  ob¬ 
jects.  SegTrack  vl  is  an  earlier  version  with  single  objects, 
which  we  use  to  expand  the  comparison  to  more  methods. 

Evaluation:  FBMS-59  scores  a  subset  of  frames  (3- 
41).  Results  are  reported  in  terms  of  precision,  recall,  F- 
measure,  and  the  number  of  objects  with  F  >  0.75.  Seg¬ 
Track  (vl  &  v2)  evaluates,  on  all  frames,  the  number  of 
pixels  incorrectly  classified  (vl).  Results  on  v2  are  reported 
as  average  intersection  over  union  overlap. 

Comparisons:  On  FBMS-59,  we  compare  against  a 
baseline  approach  [14],  one  based  on  clustering  motion 
tracks  [23],  one  segmenting  based  on  occlusion,  motion  and 
appearance  cues  [1],  and  finally  a  most  recent  one  integrat¬ 
ing  motion,  appearance,  occlusion,  and  temporal  regularity 
[30].  On  SegTrack,  we  compare  to  [8]  that  attempts  to  solve 
disocclusions  using  only  appearance  and  to  other  state-of- 
the-art  methods  [20,  19,  21,  16,  34]. 

Initialization:  On  FBMS-59,  we  report  results  of  our 
method  automatically  initialized  as  described  in  Sect.  3.4. 
On  SegTrack  our  method  is  initialized  by  the  user  in  frame 
1  and  compared  with  similarly  initialized  methods  and  also 
automated  methods.  Typically,  sequences  in  SegTrack  do 
not  have  enough  object  motion  in  the  first  few  frames  to 
ensure  proper  initialization. 

Parameters:  For  FBMS-59,  we  tune  the  parameters 
on  a  few  sequences  in  the  training  dataset,  and  then  fix 
them  on  training  and  test  datasets.  On  SegTrack,  param¬ 
eters  are  fixed.  Parameters  consistent  across  datasets  are 
7 is  =  0.1, 7^  =  0.001,7S  =  5.  Sensitivity  of  key  parame¬ 
ters  is  addressed  later. 

Results  on  FBMS-59  are  in  Table  1.  Figure  7  shows 
some  representative  outcomes.  Overall  our  method  is  more 
accurate,  even  compared  to  non-causal  (NC)  methods  that 
process  the  video  in  batch.  This  suggests  that  good  disoc- 
clusion  is  key  to  accurate  object  segmentation. 

Failure  Cases  on  FBMS-59:  The  main  source  of  er¬ 
ror  is  the  automatic  initialization  in  frame  1.  This  could 
be  mitigated  by  running  our  method  on  multiple  candidate 
initializations,  although  initialization  is  not  our  focus  here. 
To  show  that  better  initialization  would  resolve  failures, 
we  show  that  the  results  of  the  10  most  inaccurate  cases 
(typically  when  an  object  failed  to  be  detected)  improves 
with  user  annotation  in  the  first  frame  (Table  2,  Figure  8). 
Fig.  9  shows  that  our  method  recovers  from  errors  in  the 
first  frame  (short  of  failed  detection). 
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image  ground  truth  Leeetal.  [19]  Grundman  et  al.  [1  ]  Ochs  et  al.  [23]  Taylor  et  al.  [30]  ours 


Figure  7.  Sample  Visual  Results  on  FBMS-59.  Comparison  of  various  state-of-the-art  methods.  Only  a  single  frame  on  various  sequences 
are  shown.  Failure  cases  (bottom  two)  in  our  method  typically  arise  when  not  enough  motion  is  present  in  the  first  few  frames. 


Training  set  (29  sequences)  Test  set  (30  sequences) 


P 

R 

F 

N/6  5 

P 

R 

F 

N/69 

[14] 

79.17 

47.55 

59.42 

4 

77.11 

42.99 

55.20 

5 

[23] 

81.50 

63.23 

71.21 

16 

74.91 

60.14 

66.72 

20 

[1] 

87.20 

59.60 

70.81 

17 

79.64 

50.73 

61.98 

7 

[30] 

85.00 

67.99 

75.55 

21 

82.37 

58.37 

68.32 

17 

[30]-NC 

83.00 

70.10 

76.01 

23 

77.94 

59.14 

67.25 

15 

ours 

89.53 

70.74 

79.03 

26 

91.47 

64.75 

75.82 

27 

Table  1.  FBMS-59  results.  Average  precision  (P),  recall  (R), 
F-measure  (F),  and  number  of  objects  detected  (N)  over  all  se¬ 
quences  in  the  training  and  test  datasets  ofFMS-59.  Ffigher  values 
indicate  superior  performance.  All  methods  are  fully  automatic. 
[1],  [30]  and  our  method  are  causal;  other  methods  are  not. 

marple9  cats4  farml  goats  1  giraffes  1  all 

ours  (auto)  0.7950  0.7723  0.6730  0.6166  0.7515  0.7217 

ours  (manual)  0.9782  0.9025  0.7519  0.7505  0.9255  0.8617 

Table  2.  Failure  cases  on  FBMS-59  in  Fig.  7  can  be  enhanced  with 
user  annotation  in  the  first  frame.  Thus,  the  main  source  of  error  in 
our  method  is  the  initialization.  Results  are  in  terms  of  F-measure. 


Figure  8.  Sample  failure  cases  (various  frames)  on  FBMS-59  in 
Fig.  7  are  enhanced  with  user  annonation  in  the  first  frame. 


Forward-Backward  Sweeps  on  FBMS-59:  Although 
disocclusions  are  backward-occlusions,  addressed  exten- 


Figure  9.  Results  (on  FBMS-59)  with  four  different  levels  of  errors 
in  initialization.  Errors  are  mitigated  in  subsequent  frames. 


sively  in  the  literature  [29,  1],  computing  disocclusions  via 
forward-backward  sweeps  followed  by  a  grouping  proce¬ 
dure  does  not  perform  as  well  as  our  method.  We  compare 
to  the  non-causal  version  of  [30],  consisting  of  one  forward 
and  one  backward  pass.  Then,  advanced  grouping  is  per¬ 
formed  based  on  motion,  appearance,  temporal  continuity, 
and  constraints  imposed  by  occlusions/disocclusions.  The 
result,  labeled  [30]-NC  in  Table  1,  is  worse  than  ours  on  all 
measures.  This  reaffirms  that  forward-backward  sweeps  is 
not  an  adequate  approach  to  resolve  disocclusions. 

Results  on  SegTrack:  Table  3.  We  let  the  user  annotate 
the  first  frame,  as  in  [16,  8,  3*  ].  Our  method  outperforms 
all  others  on  all  but  one  sequence.  That  our  method  out¬ 
performs  [( ]  reaffirms  that  our  exploiting  complementary 
motion  and  appearance  cues  is  beneficial.  Results  on  v2 
(Table  4,  Fig.  10)  show  that  our  method  out-performs  fully 
automated  ones  but  also  those  using  user  annotation. 
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human 

ours 

[34] 

[16] 

[8] 

[21] 

[19] 

Mean 

347 

409 

535 

874 

455 

677* 

740* 

Birdfall 

130 

144 

163 

189 

265 

189 

288 

Cheetah 

308 

623 

806 

1170 

570 

806 

905 

Girl 

762 

835 

1904 

2883 

841 

1698 

1785 

Monkeydog 

306 

252 

342 

333 

289 

472 

521 

Parachute 

299 

169 

275 

228 

310 

221 

201 

Penguin 

279 

429 

571 

443 

456 

- 

136285 

Table  3.  SegTrack  vl  results.  Evaluation  is  performed  in  terms  of 


the  number  of  pixels  classified  incorrectly;  smaller  values  indicate 
superior  results.  Note  that  our  method,  [34],  [16],  and  [8]  use  user 
annotation  in  frame  1,  and  [21],  [19]  do  not. 


ours 

[34] 

[20] 

[19] 

[14] 

Mean  per  object 

76.4 

71.8 

65.9 

45.3 

51.8 

Mean  per  sequence 

77.0 

72.2 

71.2 

57.3 

50.8 

Girl 

91.6 

84.6 

89.2 

87.7 

31.9 

Birdfall 

77.3 

78.7 

62.5 

49.0 

57.4 

Parachute 

96.1 

94.4 

93.4 

96.3 

69.1 

CheetahDeer 

62.4 

66.1 

37.3 

44.5 

18.8 

CheetahCheetah 

52.2 

35.3 

40.9 

11.7 

24.4 

Monkeydog-Monkey 

84.1 

82.2 

71.3 

74.3 

68.3 

Monkeydog-Dog 

43.7 

21.1 

18.9 

4.9 

18.8 

Penguin  1 

94.0 

94.2 

51.5 

12.6 

72.0 

Penguin2 

82.1 

91.8 

76.5 

11.3 

80.7 

Penguin3 

78.4 

91.9 

75.2 

11.3 

75.2 

Penguin4 

86.3 

90.3 

57.8 

7.7 

80.6 

Penguin5 

77.1 

76.3 

66.7 

4.2 

62.7 

Penguin6 

89.0 

88.7 

50.2 

8.5 

75.5 

Drifting  Carl 

82.3 

67.3 

74.8 

63.7 

55.2 

Drifting  Car2 

77.6 

63.7 

60.6 

30.1 

27.2 

Hummingbird  1 

39.0 

58.3 

54.4 

46.3 

13.7 

Hummingbird2 

69.0 

50.7 

72.3 

74.0 

25.2 

Frog 

76.7 

56.3 

72.3 

0 

67.1 

Worm 

83.4 

79.3 

82.8 

84.4 

34.7 

Soldier 

84.0 

81.1 

83.8 

66.6 

66.5 

Monkey 

85.1 

86.0 

84.8 

79.0 

61.9 

Bird  of  Paradise 

96.1 

93.0 

94.0 

92.2 

86.8 

BMXPerson 

92.8 

88.9 

85.4 

87.4 

39.2 

BMXBike 

32.5 

5.70 

24.9 

38.6 

32.5 

Table  4.  SegTrack  v2.  The  evaluation  is  performed  in  terms  of 
the  overlap  of  the  best  segments;  larger  values  indicate  superior 
results.  Our  method  and  [34]  uses  user  annotation  in  frame  1. 


Figure  10.  Sample  SegTrack  v2  results  of  our  method. 


Sensitivity  to  Key  Parameters:  These  include  the 
ball  size  r'  and  the  threshold  parameter  k  in  our  texture¬ 
less  region  detector  (12)  and  (17).  To  this  end,  we  plot  PR 
curves  (measured  in  terms  of  correct/incorrectly  classified 
pixels)  by  fixing  one  parameter  and  varying  the  other  and 
vice-versa.  Results  (Fig.  11)  on  the  cereal  box  and  statue 
sequences  show  that  within  the  operating  range,  precision 
does  not  drop  much  as  recall  is  increased. 


Precision  vs.  Recall  Curve  Precision  vs.  Recall  Curve 


Figure  11.  Analysis  of  sensitivity  of  key  parameters  (the  threshold 
and  ball  size  of  the  textureless  detector).  [Left]:  ROC  curve  fixing 
the  ball  size  and  varying  the  threshold.  [Right]:  ROC  curve  fixing 
the  threshold  and  varying  the  ball  size. 

Computational  cost  and  implementation:  Our  unop¬ 
timized  C++  implementation  is  available2.  The  costliest 
component  is  solving  for  the  warps.  This  requires  solving  a 
linear  PDE,  for  which  there  are  many  available  fast- solvers 
that  could  be  leveraged.  We  used  conjugate  gradient,  which 
can  be  sped  up.  The  overall  cost  of  our  algorithm  varies 
with  the  amount  of  deformation  between  frames.  Using  a 
3.1GHz  12-core  processor  (with  parallelization  for  the  gra¬ 
dient  (7)),  processing  one  frame  on  FBMS-59  takes  on  av¬ 
erage  30  secs. 

5.  Discussion 

We  propose  a  method  for  handling  disocclusion  in  object 
tracking  that  does  not  require  explicit  motion  regulariza¬ 
tion,  operates  naturally  in  a  coarse-to-fine  framework,  and 
leverages  complementary  motion  and  appearance  cues.  Our 
method  exhibits  reduced  dependency  on  tuning  parameters 
than  competing  ones,  and  mitigates  typical  failures  modes. 

Our  approach  assumes  that  a  current  estimate  of  the  par¬ 
tition  into  objects  is  given  at  time  t  to  infer  the  same  at  t  + 1. 
If  the  given  partition  is  nonsensical,  most  likely  so  will  be 
the  output  of  our  inference  scheme.  This  issue  is  particu¬ 
larly  cogent  at  time  t  =  0.  It  can  be  addressed  by  spawn¬ 
ing  multiple  trackers  corresponding  to  different  initializa¬ 
tion  hypotheses,  later  aggregating  them  through  a  voting 
scheme.  In  many  tracking  applications,  however,  the  user 
decides  what  s/he  wants  to  be  tracked,  so  at  least  a  rough 
initial  partition  is  available.  This  is  the  case  for  interactive 
video  post-processing  [3].  In  this  case,  it  would  be  best  to 
process  the  entire  sequence  non-causally,  although  in  some 
cases  processing  a  sliding  batch  is  still  desirable  to  avoid 
excessive  delay  in  the  interaction  with  the  user. 

Real-time  operation  remains  a  challenge,  but  our  method 
has  potential  since  we  process  data  causally,  and  we  use 
optimization  methods  that  are  rapidly  evolving,  so  we  can 
benefit  from  their  improvements. 
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